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Abstract. I introduce a Markov chain Monte Carlo (MCMC) scheme in which sampling from a 
distribution with density 7r(x) is done using updates operating on an "ensemble" of states. The current 
state X is first stochastically mapped to an ensemble, (x*^^), . . . ,2;(^)). This ensemble is then updated 
using MCMC updates that leave invariant a suitable ensemble density, p{x^^\ . . . , x^^^) , defined in 
terms of 7r(x^*'*) for i = 1, . . . ,K. Finally a single state is stochastically selected from the ensemble 
after these updates. Such ensemble MCMC updates can be useful when characteristics of vr and the 
ensemble permit 7r(x''*^) for all i G {1, . . . to be computed in less than K times the amount of 
computation time needed to compute 7r(x) for a single x. One common situation of this type is when 
changes to some "fast" variables allow for quick re-computation of the density, whereas changes to other 
"slow" variables do not. Gaussian process regression models are an example of this sort of problem, 
with an overall scaling factor for covariances and the noise variance being fast variables. I show that 
ensemble MCMC for Gaussian process regression models can indeed substantially improve sampling 
performance. Finally, I discuss other possible applications of ensemble MCMC, and its relationship to 
the "multiple-try Metropolis" method of Liu, Liang, and Wong and the "multiset sampler" of Leman, 
Chen, and Lavine. 



Introduction 

In this paper, I introduce a class of Markov chain Monte Carlo methods that utilize a state space that 
is the ET-fold Cartesian product of the space of interest — ie, although our interest is in sampling for 
X in the space we use MCMC updates that operate on {x^^\ . . . ,x^^^) in the space y = . 

Several such methods have previously been proposed — for example, Adaptive Directive Sampling 
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(Gilks, Roberts, and George, 1994) and Parallel Tempering (Geyer, 1991; Earl and Deem, 2005). The 
"ensemble MCMC" methods I introduce here differ from these methods in two fundamental respects. 
First, use of the space y = is temporary — after some number of updates on 3^, one can switch 
back to the space X, perform updates on that space, then switch back to y, etc. (Of course, one 
might choose to always remain in the y space, but the option of switching back and forth exists.) 
Secondly, the invariant density for an ensemble state {x^^\ . . . ,x^^^) in the y space is proportional 
to the sum of the probabilities of the individual x^''^ (or more generally a weighted sum), whereas 
in the previous methods mentioned above, the density is a product of densities for each x^'^\ As a 
consequence, the properties of the ensemble methods described here are quite different from those of 
the methods mentioned above. 

I expect that use of an ensemble MCMC method will be advantageous only when, for the particular 
distributions being sampled from, and the particular ensembles chosen, a computational short-cut 
exists that allows computation of the density for all of x^^\ . . . , x^^^ with less than K times the effort 
needed to compute the density for a single state in X. Without this computational advantage, it is 
hard to see how performing updates on X^ could be beneficial. 

In this paper, I mostly discuss one particular context where such a computational short-cut exists 
— when the distribution being sampled has the characteristic that after changes to only a subset of 
"fast" variables it is possible to re-computc the density in much less time than is needed when other 
"slow" variables change. By using ensembles of states in which the slow variables have the same values 
in all ensemble members, the ensemble density can be quickly computed. In the limit as the size of 
the ensemble grows, it is possible using such a method to approach the ideal of updating the slow 
variables based on their marginal distribution, integrating over the fast variables. 

I apply these fast-slow ensemble MCMC methods to Gaussian process regression models (Rasmussen 
and Williams 2006; Neal 1999) that have a covariance function with unknown parameters, which in a 
fully Bayesian treatment need to be sampled using MCMC. The computations for such models require 
operations on the covariance matrix whose time grows in proportion to n^, where n is the number 
of observations. If these computations are done using the Cholesky decomposition of the covariance 
matrix, a change to only the overall scale of the covariance function does not require recomputation of 
the Cholesky decomposition; hence this scale factor can be a fast variable. If computations are done 
by finding the eigenvectors and eigenvalues of the covariance function, both an overall scale factor and 
the noise variance can be fast variables. I show that ensemble MCMC with either of these approaches 
can improve over MCMC on the original state space. 

I conclude by discussing other possible applications of ensemble MCMC, and its relationship to the 
"multiple-try Metropolis" method of Liu, Liang, and Wong (2000), and to the "multiset sampler" of 
Leman, Chen, and Lavine (2009). 

MCMC with an ensemble of states 

Here, I introduce the idea of using an ensemble of states in general, using the device of stochastically 
mapping from the space of interest to another space. I also discuss several ways of defining an ensemble. 
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I will suppose that we wish to sample from a distribution on X with probability density or mass 
function Tr{x). The MCMC approach is to define a Markov chain that has tt as an invariant distribution. 
Provided this chain is ergodic, simulating the chain from any initial state for a suitably large number 

of transitions will produce a state whose distribution is close to tt. To implement this strategy, we 
need to define a transition probability, T{x'\x), for the chain to move to state x' if it is currently in 
state X. Updates made according to this transition probability must leave tt invariant: 

fm'l.),. = .(X') (1) 



Stochastically mapping to an ensemble and back. An MCMC update that uses an ensemble 
of K states can be viewed as probabilistically mapping from the original space, X, to a new space, 
3^ = , performing updates on this new space, and then mapping back to the original space. We can 
formalize this "temporary mapping" strategy, which has many other applications (Neal 2006; Neal 
2010, Section 5.4), as follows. We define a transition distribution, T{x'\x), on the space X as the 
composition of three other stochastic mappings, T, T, and T: 



X 




That is, starting from the current state, x, we obtain a value in the temporary space by sampling from 
T{y\x). The target distribution for y G 3^ has probability /density function p(y). We require that 

j Tr{x)f{y\x)dx = p{y) (3) 

T{y'\y) defines a transition on the temporary space that leaves p invariant: 

J p{y)f{y'\y)dy = p{y') (4) 

Finally, T gets us back to the original space. It must satisfy 

j p{y')f{xy)dy' = nix') (5) 

The above conditions imply that the overall transition, Tix'\x), will leave tt invariant, and so can be 
used for Markov sampling of vr. 

In ensemble MCMC, where y = X^ , with the ensemble state written as y = (a;(i),...,a;(-^)), the 
stochastic mapping T, from X to y, is defined in terms of an ensemble base measure, a distribution 
with probability density or mass function C{x^^\ . . . , x^^^), as follows: 

1 ^ 

f(x«,...,xW|x) = -J2C-k\k{x^-'^\x)S,ix^'^^) (6) 

fe=i 

Here, Sxi-) is the distribution with mass concentrated at x. The notation x^~'^^ refers to all compo- 
nents of the ensemble state other than the fc'th. The probability density or mass function for the 
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conditional distribution of components other than the fc'th given a value for the /c'th component is 
C-k\kix^~^Mx^''^) = Cix^^\ ■ ■ ■ ,x^^^) I Ckix^''^), where Cfe is the marginal density or mass function for 
the fc'th component, derived from the joint distribution We assume that (kix) is non-zero if tt{x) 
is non-zero. 

Algorithmically, T creates a state in y by choosing an index, k, from 1 to K, uniformly at random, 
setting x^''^ to the current state, x, and finally randomly generating all x^^^ for j ^ k from their 
conditional distribution under given that x^'^^ = x. Note that T depend only on ^, not on vr. 



A simple example: As a simple (but not useful) illustration, let X = [0, 2tt) and set K = 2. Choose 
the ensemble base measure, to be uniform on pairs of points with x^"^^ = x^^^ + oo (mod 27r), for 
some constant oj G [0, 27r). (le, the ensemble base measure is uniform over pairs of points on a unit 
circle with the second point at an angle to counterclockwise from the first.) Then C-i|i(' \x^^^) = 
^xw+u (mod 2^)( • ) and C2|2( ' l^^^^^) = (mod 2^)( ' )• The algorithm for the map f is 

1) pick k uniformly from {1,2} 

2) set x^'^^ = X 

3) if A; = 1, set x^"^^ = x + ui (mod 27r); if A; = 2, set x^-^'^ = x — uj (mod 2-k). 



Having defined T as in equation ([6]), the density function, p, for the ensemble distribution that will 
make condition ([3]) hold can be derived as follows: 

K 



W^-k\k{x^~'^\x^'^)<x^'') (8) 



K 

k=l 



1 JS^cix^^^ 



c(x«,...,.w)1j:^^ (10) 



So we see that the density function of p with respect to the base measure ^ is sometimes simply 
proportional to the sum of vr for all the ensemble members (when the Cfc are all the same uniform 
distribution), and more generally is proportional to the sum of ratios of vr and Cfc- 

The simple example continued: Ci a-nd ^2 are both uniform over [0, 27r). For pairs in satisfying the 
constraint that x^^^ = x^^^ + uj (mod27r), the ensemble density follows the proportionality: 

p(x(^\x(2)) oc 7r(xW) + 7r(x(2)) (11) 

(The probability under p of a pair that violates the constraint is of course zero.) 
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We can let T be any sequence of Markov updates for y = {x^^\ . . . ,x^^^) that leave p invariant 
(condition dH). We denote the result of applying T to y hy y' . 

The simple example continued: We might define T to be a sequence of some pre-defined number 
of Metropolis updates (Metropolis, et al 1953), using a random- walk proposal, which from state 
{x^^\ x^^^) is {x'^}\ xl^'') = (x*-^-* + V (mod27r), x^^^ + v (mod27r)), with v being drawn from a 
distribution symmetrical around zero, such as uniform on (— vr/lO, +7r/10). Such a proposal is accepted 



with probability min[l, (7r(xl^^) + 'k{x\''')) / (7r(x^"^0 + tt{x^^^)\. 



(2)^ 



To return to a single state, x' , from the ensemble state y' = {x'^^\ . . . , x'^^^), we set x' to one of the 
x^^\ which we select randomly with probabilities proportional to Txix^^^) / C,k{x^^^). We can see that 
this T satisfies condition ([5]) as follows: 



p{y')T{x'\y')dy' 



K 



/ 7r(x')C-,|,(x(-^')|x')dx(-^-) 
-ij^7r(x') = vr(xO 



if 

E 



vr 



(xW) 



dx«-- 



•(ix(^) (12) 
(13) 
(14) 
(15) 



T/ie simple example continued: Since Ci and C2 are uniform over [0,27r), f simply picks either x*-^^ or 

.(2)^ 



with probabilities proportional to 7r(x(-'^)) and 7r(x^ 



Some possible ensembles. The ensemble base measure, can be defined in many ways, some 
of which I will discuss here. Note that the usefulness of these ensembles depends on whether they 
produce any computational advantage for the distribution being sampled. This will be discussed in 
the next section for problems with fast and slow variables, where variations on some of the ensembles 
discussed here will be used. 

One possibility is that x*^^^, . . . ,x^^^ are independent and identically distributed under ^, so that 
C(x^^\ . . . , x^^^) = C{x^^^) ■ ■ ■ C{x^^^), where C{x) is the marginal distribution of all the x'-'^-*. In 
the conditional distribution C-k\ki the components other than x*^'^^ will also be independent, with 
distribution ^(x^'^^) the same as their marginal distribution. With this ensemble base measure, the 
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ordering of states in the ensemble is irrelevant, and the mapping, T, from a single state to an ensemble 
consists of combining the current state with K — 1 states sampled from the marginal The mapping, 
f, from the ensemble to a single state consists of randomly selecting a state from x^^^ , ■ ■ ■ , x'^^^ with 
probabilities proportional to ■k{x'^^'^)/C{x'^^'>). 

Rather than the x^^^ being independent in the ensemble base measure, they might just be exchange- 
able. This can be expressed as the x*-*^ being independent given some paramater 0, which has some 
"prior", C,[0) (which is unrelated to the real prior for a Bayesian inference problem). In other words, 

K 

k=i 

The mapping T can be implemented by sampling 6 from the "posterior" density C,{6 \ x^^^ = x) gc 
(^{9) (^(x^'^^ = x\9), for k randomly chosen from 1, . . . ,K (though due to exchangeability, this random 
choice of k isn't really necessary) and then sampling x^^^ for j ^ k independently from (^{x^^^\9) (which 
is the same for all j). The marginal distributions for the x^^'^ in the ensemble base measure, needed 
to compute p and to map from an ensemble to a single state, are of course all the same when this 
measure is exchangeable. 

Another possibility is for the ensemble base measure on x^^\ . . . , x^^^ to be defined by a stationary 
Markov chain, which again leads to the x^^^ all having the same marginal distribution, Ci^)- For 
example, if X is the reals, we might let C{x) = N{x; 0,r^), where N{x; fJ-,T'^) is the normal density 
function, and Cfe+i|A;(2^*-^^^'' k*-*^-*) = N{x^^^^'>; x^^^y^l — l/r^, 1) for k = 1, . . . , K —1. Going from a 
single state to an ensemble is done by randomly selecting a position, k, for the current state, and then 
simulating the Markov chain forward from x^^^ to x^^^ and backward from x*-'^^ to x^^^ (which will be 
the same as forward simulation when the chain is reversible, as in the example here). The ensemble 
density is given by equation (fTOj). with all the marginal densities Cki^) equal to C{x). We return 
from an ensemble to a single state by selecting from x^^\ . . . ,x^^^ with probabilities proportional to 
7r(x('=))/C(xW). 

Ensemble base measures can also be defined using constellations of points. Let xi^\ . . . be 
some set of K points in X. We let ( be the distribution obtained by shifting all these points by an 
amount, 9, chosen from some distribution on X, with density C{9) that is nowhere zero, so that 

K 

c(xw,...,xW) = /c(^)n^.i^-'+e(^^'^)^^ (17) 

k=i 

The conditional distributions C-k\k do not depend on ({9) — they are degenerate distributions in 
which the other constellation points are determined by x^*^^. We therefore move from a single point, x, 
to a constellation of points by selecting k at random from 1, . . . ,K, setting x^'^^ to x, and setting x^-'^ 
for J 7^ fc to xl'''' + X — xl^''. The ensemble density, p{x^^\ . . . ,x^^^), will also not depend on C{9), as 
can be seen from equation ([8])) — it will be proportional to the sum of 7r(x(^'') for ensembles that have 
the shape of the constellation, and be zero for those that do not. The marginal densities, Ck{x^^'^), will 
be the same for all constellation points, since 9 = x^'^-* — xl^^ will be the same for all k. (Note that this 
is not the same as the marginal density functions being the same for all k.) Hence moving to a single 
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point from a constellation is done by choosing a point, x^'^^ from the constellation with probabilities 
proportional to 7r(x^'^^). 

The simple example used in the previous section can be seen as a constellation of two points with 
xl^^ = 0, xl^^ = uj, ("(0) uniform over [0,27r), and addition done modulo 27r. 

In the presentation above, the ensemble base measure C is assumed to be a proper probability 
distribution, but C can instead be an improper limit of proper distributions, as long as the conditional 
distributions and ratios of marginal distributions that are needed have suitable limits. In particular, 
the conditional distributions C-k\k used to define T in equation ([6|) must reach limits that are proper 
distributions; this will also ensure that p is well defined (see equation ([8])). The ratios Cj{x^^^) / Ck{x^^^) 
must also reach limits, so that T will be well defined. 

We can obviously let become improper when defining a constellation ensemble, since we have 
seen that the choice of this density has no effect. For exchangeable ensembles, we can let ({6) 
be improper as long as the "posterior", (^(9\x^^^ = x), is proper. We can also let r go to infin- 
ity in the Markov ensemble base measure defined above. We will then have Ck+ilki^^'^^^^^^^^) = 
N{x^^^^^; x^''\l), so the conditional distributions C-fcjfc ^re simple random walks in both directions 
from x^^\ Since the marginal distributions for the x^^^ are all normal with mean zero and variance r-^, 
for any x^^^ and x^^\ the ratio C{x^^^) / C,{x^^^) approaches one as r goes to infinity. (Note also that 
since the limit of C-fc|fc is proper, the relevant values of x^^^ will not diverge, so uniform convergence 
of these ratios is not necessary.) 

Ensemble MCMC for problems with "fast" and "slow" variables 

Suppose that we wish to sample from a distribution on a space that can be written as a Cartesian 
product, X = Xi X X2, with elements written as x = (xi,X2). Suppose also that the probability 
density or mass function, 7r(xi, X2), is such that re-computing the density after a change to X2 is much 
faster than recomputing the density after a change to xi. That is, if we have computed 7r(xi,X2), and 
saved suitable intermediate results from this computation, we can quickly compute 7r(xi,X2) for any 
Xg, but there is no short-cut for computing 'it{x[,X2) for x[ x\. In general, x\ and X2 might both 
be multidimensional. I refer to x\ as the "slow" variables, and X2 as the "fast" variables. 

I first encountered this class of problems in connection with models for the Cosmic Microwave 
Background (CMB) radiation (Lewis and Bridle, 2002). This is an example of data modelled using a 
complex simulation, in this case, of the early universe. In such problems, the slow variables, xi, are 
the unknown parameters of the simulation, which are being fit to observed data. When new values for 
these parameters are considered, as for a Metropolis proposal, computing vr with these new values will 
require re-running the simulation, which we assume takes a large amount of computation time. Some 
fairly simple model relates the output of the simulation to the observed data, with parameters X2 (for 
example, the noise level of the observations). When a new value for X2 is considered in conjunction 
with a value for xi that was previously considered, the simulation does not need to be re-run, assuming 
the output from the previous run was saved. Instead, computing vr with this new X2 and the old x\ 
requires only some much faster calculation involving the observation model. 
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Fast and slow variables arise in many other contexts as well. For example, in many Bayesian models, 
the posterior density for a set of low-level parameters involves a large number of data items, which 
cannot be summarized in a few sufficient statistics, while a small number of hyperparameters have 
densities that depend (directly) only on the low-level parameters. The low-level parameters in such 
a situation will be "slow", since when they are changed, all the data items must be looked at again, 
but the hyperparameters may be "fast", since when they change, but the low-level parameters stay 
the same, there is no need to look at the data. 

Later in this paper, I will consider applying methods for fast and slow variables to the more spe- 
cialized problem of sampling from the posterior distribution of the parameters of a Gaussian process 
regression model, in which an overall scale factor for the covariance function and the noise variance 
can be regarded as fast variables. 

Previous methods for problems with fast and slow variables. A simple way of exploiting 
fast variables, used by Lewis and Bridle (2002), is to perform extra Metropolis updates that change 
only the fast updates, along with less frequent updates that change both fast and slow variables. 
Lewis and Bridle found that this is an improvement over always performing updates for all variables. 
Similarly, when using a Metropolis method that updates only one variable at a time, one could perform 
more updates for the fast variables than for the slow variables. I will refer to these as "extra update 
Metropolis" methods. 

Another simple fast-slow method I devised (that has been found useful for the CMB modeling 
problem) is to randomly choose a displacement in the slow space, 5i G Xi, and then perform some 
pre-determined number of Metropolis updates in which the proposal from state (xi, X2) is {xi±6i, X2), 
where the sign before 5i is chosen randomly, and Xg is chosen from some suitable proposal distribution, 
perhaps depending on xi, X2, and the chosen sign for 61. During such a sequence of "random grid 
Metropolis" updates, all proposed and accepted states will have slow variables of the form xi + idi, 
where xi is the initial value, and i is an integer. If intermediate results of computations with all 
these values for the slow variables are saved, the number of slow computations may be much less than 
the number of Metropolis proposals, since the same slow variables may be proposed many times in 
conjunction with different values for the fast variables. 

If recomputation after a change to the fast variables is very much faster than after a change to 
the slow variables, we might hope to find an MCMC method that samples nearly as efficiently as 
would be possible if we could efficiently compute the marginal distribution for just the slow variables, 
7r(xi) = / 7r(xi, X2) dx2, since we could approximate this integral using many values of X2. I have 
devised a "dragging Metropolis" method (Neal, 2004) that can be seen as attempting to achieve this. 
The ensemble MCMC methods I describe next can also be viewed in this way. 

Applying ensemble MCMC to problems with fast and slow variables. Ensemble MCMC can 
be applied to this problem using ensembles in which all states have the same value for the slow variables, 
xi. Computation of the ensemble probability density (equation (fTOj) ) can then take advantage of quick 
computation for multiple values of the fast variables. Many sampling schemes of this sort are possible. 
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distinguished by the ensemble base measure used, and by the updates that are performed on the 
ensemble (ie, by the transition T). 

For all the schemes I have looked at, the fast and slow variables are independent in the base 
ensemble measure used, and the slow variables have the same value in all members of the ensemble. 
The ensemble base measure therefore has the following form: 



K 



k=2 



c(xW,...,xW) = kxW)j]5^,,(4^))i ^(4i),...,4^)) (18) 



Here, w(xi) is some density for the slow variables — the choice of uj will turn out not to matter, as 
long as it is nowhere zero. That the slow variables have the same values in all ensemble members is 

(k) 

expressed by the factors of {x\ ). The joint distribution of the fast variables in the K members of 
the ensemble is given by the density function ^. If we let be the marginal distribution for the fc'th 
member of the ensemble under ^, we can write the ensemble density from equation (jlOp as follows: 

p(x(^...,x(^)) = C(x(^\...,x(^^)^E?77^ (19) 

^ ^ Ckix^'')) 

K 1 K r (k)\ 

= [n*,,.(.f')]«4".....4"-')^E^ p«> 

k=2 k=l 'ik{X2 ) 

Possible choices for ^ include those analogous to some of the general choices for discussed in the 
previous section, in particular the following: 

independent 

Let Xg ^ be independent under ^, with all having the same marginal distribution. For 

example, if the fast variables are parameters in a Bayesian model, we might use their prior 
distribution. 

exchangeable 

Let 

^2 ^) • • • ''be exchangeable under ^. If X2 is scalar, we mi ght let x^2^\9 ~ iV(0,r2) and 
6 ~ N{0,tp'^). We could then let ip go to infinity, so that ^ is improper, and all are the same. 

grid points 

(k) (k) 

Let X2 = xl + 6 for k = 1, . . . ,K. Here, if X2 has dimension D, ^ is a D-dimensional offset 

(k) 

that is applied to the set of points xl . This is an example of a constellation, as discussed in the 
previous section. One possibility is a rectangular grid, with K = for some integer m, and 
with grid points spaced a distance dj in dimension j, so that the extent of the grid in dimension 
J is {m — l)dj. 

The ensemble update, T, could be done in many ways, as long as the update leaves p invariant. A 
simple and general option is to perform some number of Metropolis-Hastings updates (Metropolis, et al 
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1953; Hastings 1970), in which we propose to change the current ensemble state, y = {x'^^\ . . . , x^^^), to 
a proposed state, y* , drawn according to some proposal density g{ii*\y). We accept the proposed y* as 
the new state with probability min[ 1, p{y*) g{y\y*) / p{y) g{y* \y) ], where p is defined by equation ([20]) . 

(k) 

A technical issue arises when X is continuous — since x\ must be the same for all members of the 
ensemble, the probability densities will be infinite (eg, due to the delta functions in equation (f20]l ). We 
can avoid this by looking at densities for only x^^^ and x^\ . . . , x'^\ since x^^\ . . . , x[^^ are just equal 
to x'iK (This assumes that we never propose states violating this constraint.) A similar issue arises 

(k) 

when using an ensemble of grid points, as defined above, in which the are constrained to a grid 
— here also, we can simply leave out the infinite factors in the density that enforce the deterministic 
constraints, assuming that our proposals never violate them. 

I will consider two classes of Metropolis updates for the ensemble state, in which symmetry of the 
proposal makes g{y\y*)/g{y*\y) equal to one, so the acceptance probability is min[l, p{y*)/ p{y)]: 

fast variables fixed 

Propose a new value for xi in all ensemble members, by adding a random offset drawn from some 
symmetrical distribution to the current value of xi. Keep the values of x'^\ . . . , x'^'^ unchanged 
in the proposed state. 

fast variables shifted 

Propose a new value for xi in all ensemble members as above, together with new values for 
2^"* found by adding an offset to all of x^2^ , ■ ■ ■ , x^2^^ that is randomly drawn from 
some symmetrical distribution. 

These two schemes are illustrated in Figured) Either scheme could be combined with any of the three 




Figure 1: Illustration of ensemble updates with with one slow variable (horizontal) and one fast variable 
(vertical). The distribution is uniform over the outlined region. A grid ensemble with i^T = 10 is used. 
On the left, a move is proposed by changing only the slow variable; it will be accepted with probability 
1/4, since four members of the current ensemble have non-zero probability, compared with only one 
in the proposed ensemble. On the right, a move is proposed by changing the slow variable and also 
shifting the grid ensemble; it will be accepted with probability 2/4. (Of course, a less favourable shift 
in the grid ensemble might have led to zero probability of acceptance.) 
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types of ensembles above, but shifting the fast variables when they are sampled independently from 
some distribution seems unlikely to work well, since after shifting, the values for the fast variables 
would not be typical of the distribution. 

One could perform just one Metropolis update on the ensemble state before returning to a single 
point in X, or many such updates could be performed. Moving back to a single point and then 

regenerating an ensemble before performing the next ensemble update will requires some computation 
time, but may improve sampling. Updates might also be performed on X (perhaps changing only fast 
variables) before moving back to an ensemble. 

Fast and slow variables in Gaussian process regression models 

Gaussian process regression (Neal 1999; Rasmussen and Williams, 2006) is one context where ensemble 
MCMC can be applied to facilitate sampling when there are "fast" and "slow" variables. 

A brief introduction to Gaussian process regression. Suppose that we observe pairs (2;^*^ yW) 
for z = 1, . . . , n, where z^''^ is a vector of p covariates, whose distribution is not modeled, and y^*) the 
corresponding real-valued response, which we model as y^'^' = f{z^^) + e^, where / is an unknown 
function, and e^*-* is independent Gaussian noise with mean zero and variance o"^. We can give / a 
Gaussian process prior, with mean zero and some covariance function C{z,z') = E[f{z)f{z')]. We 
then base prediction for a future observed response, y*, associated with covariate vector z*, on the 
conditional distribution of y* given y = {y^^\ . . . , y^"^). This conditional distribution is Gaussian, with 
mean and variance that can be computed as follows: 

£;(y*|y) = rS-iy, \aiiy*\y) = v - k^^~'k (21) 

Here, S = K + a^I is the covariance matrix of y, with Kij = C{z^^\ z^^^). Covariances between y* and 
y are given by the vector k, with ki = C{z* , z^^^). The marginal variance of y* is v = C{z* , z*) + a"^. 

If the covariance function C and the noise variance cr^ are known, the matrix operations above are 
all that are needed for inference and prediction. However, in practice, the noise variance is not known, 
and the covariance function will also have unknown parameters. For example, one commonly-used 
covariance function has the form 

Ciz,z') = ry2(a2 + exp(-^(z.,(z,-4))')) (22) 

h=l 

Here, the term allows for the overall level of the function to be shifted upwards or downwards 
from zero; it might be fixed a priori. However, we typically do not know what are suitable values for 
the parameter r], which expresses the magnitude of variation in the function, or for the parameters 
z/i,...,fp, which express how relevant each covariate is to predicting the response (with 1/^ = 
indicating that z^ has no effect on predictions for y). 

In a fully Bayesian treatment of this model, ij, ly, and a are given prior distributions. Independent 
Gaussian prior distributions for the log of each parameter may be suitable, with means and variances 
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that are fixed a priori. In the models I fit below, the logs of the components of i' are correlated, 
which is appropriate when learning that one covariate is highly relevant to predicting y increases our 
belief that other covariates might also be relevant. After sampling the posterior distribution of the 
parameters with some MCMC method, predictions for future observations can be made by averaging 
the Gaussian distributions given by (j2ip over values of the parameters taken from the MCMC run. 

Sampling from the posterior distribution requires the likelihood for the parameters given the data, 
which is simply the Gaussian probability density for y with these parameters. The log likelihood, 
omitting constant terms, can therefore be written as 

logL{r],u,a) = -(1/2) log det S(r?, z/, a) - (l/2)y^E(7?, i., ct)-^ (23) 

This log likelihood is usually computed by finding the Cholesky decomposition of S, which is the 
lower-triangular matrix L for which LL^ = S. From L, both terms above are easily calculated. In 
the first term, the determinant of S can be found as (detL)^, with detL being just the product of 
the diagonal entries of L. In the second term, y'^Ti^^y = y^{LL'^)~^y = u^u, where u = L~^y can be 
found by forward substitution. 

Computing the Cholesky decomposition of S requires time proportional to n^, with a rather small 
constant of proportionality. Computation of S requires time proportional to n^p, with a somewhat 
larger constant of proportionality (which may be larger still when covariance functions more complex 
than equation (|22p are used). If p is fixed, the time for the Cholesky decomposition will dominate for 
large n, but for moderate n (eg, n = 100, p = 10), the time to compute S may be greater. 

Expressing computations in a form with fast variables. I will show here how the likelihood 
for a Gaussian process regression model can be rewritten so that some of the unknown parameters 
are "fast". Specifically, if computations are done using the Cholesky decomposition, rj can be a fast 
parameter, and if computations are instead done by finding the eigenvectors and eigenvalues of the 
covariance matrix, both rj and a can be fast parameters. Note that the MCMC methods used work 
better when applied to the logs of the parameters, so the actual fast parameters will be log{r]) and 
log(o"), though I sometimes omit the logs when referring to them here. 

The form of covariance function shown in equation (j22p has already been designed so that r] can 
be a fast parameter. Previously, I (and others) have used covariance functions in which the constant 
term a is not multiplied by the scale factor 77^, but writing it as in equation (|22p will be essential for 
r/ to be a fast parameter. As an expression of prior beliefs, it makes little difference whether or not 

is multiplied by •q'^, since a is typically chosen to be large enough that any reasonable shift of the 
function is possible, even if a is multiplied by a value of rj at the low end of its prior range. Indeed, 
one could let a go to infinity — analogous to letting the prior for the intercept term in an ordinary 
linear regression model be improper — without causing any significant statistical problem, although in 
practice, to avoid numerical problems from near-singularity of S, excessively large values of a should 
be avoided. A large value for a will be usually be unnecessary if the responses, y, are centred to have 
sample mean zero. 

To make r] a fast variable when using Cholesky computations, we also need to rewrite the contri- 
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bution of the noise variance to S as if times a"^ /"rf = ijj'^. We will then do MCMC with log (77) as 
a fast variable and \og{ij)) and log(z//j) for h = 1, ... ,p as slow variables. (Note that the Jacobian for 
the transformation from {\og{rj) ,\og{a)) to (log(?7), log('i/')) has determinant one, so no adjustment of 
posterior densities is required with this transformation.) 

Fast recomputation of the likelihood after rj changes can be done if we write S = rfiT + 4''^ I), 
where 

T,, = a' + exp(-X:(^^.(4^-4^'^))') (24) 

h=l 

T + ip'^I is not a function of log(ry), but only of log{'tp) and the log(i//i). Given values for these slow 
variables, we can compute the Cholesky decomposition of T + ip"^!, and from that det(T + ip'^I) and 
y^(T + 'ijj'^I)y, as described earlier. If these results are saved, for any value of the fast variable, \og{'q), 
we can quickly compute detS = 77'^"'det(T + ip'^) and y^T,~^y = y^{T + ip'^I)^^y / "rf , which suffice 
for computing the likelihood. 

Rather than use the Cholesky decomposition, we could instead compute the likelihood for a Gaus- 
sian process model by finding the eigenvectors and eigenvalues of S. Let E be the matrix with the 
eigenvectors (of unit length) as columns, and let Ai,...,An be the corresponding eigenvalues. The 
projections of the data on the eigenvectors are given by n = E'^y. The log likelihood of ()23p can then 
be computed as follows: 

logL = -^Y.^ogK - ij]^ (25) 

1=1 i=l * 

Both finding the Cholesky decomposition of an n x n matrix and finding its eigenvectors and eigenvalues 
take time proportional to , but the constant factor for finding the Cholesky decomposition is smaller 
than for finding the eigenvectors (by about a factor of fifteen), hence the usual preference for using 
the Cholesky decomposition. 

However, if computations are done using eigenvectors and eigenvalues, both r/ and a can be made 
fast variables. To do this, we write S = rfT + a"^!. Given values for the slow variables, \og{uh) for 
/i = 1, . . . ,p, we can compute T, and then find its eigenvalues, Ai, . . . , A^, and the projections of y on 
each of its eigenvectors, which will be denoted Ui for i = 1, . . . ,n. If we save these Aj and Uj, we can 
quickly compute L for any values of r/ and a. The eigenvectors of S are the same as those of T, and 
the eigenvalues of S are "rfXi + o"^ for i = 1, . . . , n. The log likelihood for the \og{vh) values used to 
compute T along with values for the fast variables log(r/) and log((T) can therefore be found as 

logL = -ij:,„,(,^A. + .^)-iE;^ (26) 

i=l i=l 

This procedure must be modified to avoid numerical difficulties when T is nearly singular, as can 
easily happen in practice. The fix is to add a small amount to the diagonal of T, giving T' = T + r^/, 
where r"^ is a small amount of "jitter", and then using T' in the computations described above. The 
statistical effect of this is that the noise variance is changed to •rfr'^ + o"^. By fixing r to a suitably 
small value, the difference of this from a"^ can be made negligible. For consistency, I also use T' for 
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the Cholesky method (so the Cholesky decomposition is of T' + tp'^I), even though adding jitter is 
necessary in that context only when tp'^ might be very smah. 

To summarize, if we use the Cholesky decomposition, we can compute the likelihood for all members 
of an ensemble differing only in r] using only slightly more time than is needed to compute the likeli- 
hood for one point in the usual way. Only a small amount of additional time, independent of n and p, 
is needed for each member of the ensemble. Computation of eigenvectors is about fifteen times slower 
than the Cholesky decomposition, and computing the likelihood for each ensemble member using these 
eigenvectors is also slower, taking time proportional to n. However, when using eigenvector compu- 
tations, both rj and a can be fast variables. Which method will be better in practice is unclear, and 
likely depends on the values of n and p — for moderate n and large p, time to compute the covariance 
matrix, which is the same for both methods, will dominate, favouring eigenvector computations that 
let 0" be a fast variable, whereas for large n and small p, the smaller time to compute the Cholesky 
decomposition may be the dominant consideration. 

If we wish to use a small ensemble over both rj and a, it may sometimes be desirable to do the 
computations using the Cholesky decomposition, recomputing it for every value of the fast variables. 
This would make sense only if the time to compute the covariances (apart from a final scaling by r/^ 
and addition of cr'^I) dominates the time for these multiple Cholesky computations (otherwise using 
the ensemble will not be beneficial), and the ensemble has less than about fifteen members (otherwise 
a single eigenvector computation would be faster). However, I do not consider this possibility in the 
demonstraton below, where larger ensembles are used. 

A demonstration 

Here, I demonstrate and compare standard Metropolis and ensemble MCMC on the problem of sam- 
pling from the posterior distribution for a Gaussian process regression model, using the fast-slow 
methods described above. The model and synthetic test data that I use were chosen to produce 
an interesting posterior distribution, having several modes that correspond to different degress of 
predictability of the response (and hence different values for a). 

The MCMC programs were written in R, and run on two machines, with two versions of R — version 
2.8.1 on a Solaris/SPARC machine, and version 2.12.0 on a Linux/Intel machine. The programs used 
are available at my web site0 

The synthetic data. In the generated data, the true relationship of the response, y, to the vector 
of covariates, z, was y = f{z) + e, with e being independent Gaussian noise with standard deviation 
0.4, and the regression function being 

f{z) = 0.7zl + 0.8sin(0.3-h(4.5-F0.5zi)z2) + 0.85 cos(0.1 5^3 O.lz^) (27) 

The number of covariates was p = 12, though as seen above, f{z) depends only on zi, Z2, and z^. 
The covariate values were randomly drawn from a multivariate Gaussian distribution in which all the 

^ http : //www . cs .utoronto . ca/~radf ord/ 
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Zh had mean zero and variance one. There were weak correlations among zi, Z2, and Z3. There were 
strong (0.99) correlations between zi and z^, Z2 and Z5, and Z3 and zq, and moderate (0.9) correlatons 
between zi and zj, Z2 and zg, and z^ and zg. Covariates ziq, zu, and Z12 were independent of each 
other and the other covariates H 

I generated n = 100 independent pairs of covariate vectors and response values in this way. 

The Gaussian process model. This synthetic data was fit with a Gaussian process regression 
model of the sort described in the previous section, in which the prior covariance between responses 
y(') and y^^^ was 

p 

Cov(y«, y(^)) = (a^ + exp (- ^ [u, (z« - 4^)))' ) + r%) + a% (28) 

h=l 

where 5ij is zero if i 7^ j and one if z = j. I fixed a = 1 and r = 0.01. 

The priors for rj, a, and u were independent. The prior for log(cr) was Gaussian with mean log(0.5) 
and standard deviation 1.5; that for log(r/) was Gaussian with mean of log(l) and standard deviation 
1.5. The prior for u was multivariate Gaussian with each log(i^;i) having mean log(0.5) and standard 
deviation 1.8, and with the correlation of and z^/j' for h ^ h' being 0.69. Having a non-zero 
prior correlation between the Vh is equivalent to using a prior expressed in terms of a higher-level 
hyperparameter, vq, conditional on which the Uh are independent with mean z/q. 

Performance of standard Metropolis sampling. I tried sampling from the posterior distribution 
for this model and data using standard random-walk Metropolis methods, with the state variables 
being the logs of the parameters of the covariance function (the twelve Uh parameters, rj, and a). 

I first tried using a multivariate Gaussian Metropolis proposal distribution, centred at the current 
state, with independent changes for each variable, with the same standard deviation — ie, the proposal 
distribution was N{x,s'^I), where x is the current state, and s is the proposal standard deviation. 

Figure [2] shows results with s = 0.25 (top) and s = 0.35 (bottom). The plots show three quantities, 
on a logarithmic scale, for every 300th iteration from a total of 300,000 Metropolis updates (so that 
1000 points are plotted). The quantity shown in red is the average of the I'h parameters (giving 
relevances of the covariates), that shown in green is the rj parameter (the overall scale), and that 
shown in blue is the a parameter (noise standard deviation). 

When s = 0.25, the rejection rate is 87%. From the top plot, it appears that the chain has converged, 
and is sampling from two modes, characterized by values of a around 0.5, or much smaller values. 
Movement between these modes occurs fairly infrequently, but an adequate sample appears to have 
been obtained in 300,000 iterations. 

^ In detail, the covariates were generated by letting Wh for /i = 1, . . . , 12 be independent standard normals, and then 
letti ng zi = w i, Z2 = 0.252i + W2VI - 0. 25^ Z3 = 0.2522 + W3VI - 0.25^ 24 = 0.99ziJ^^/T^^0^, zs = 0.99^2 + 
- 0.992, ^ O.9923+M6VI - 0.992, ^ 0.9zi+wrVl - 0.92, Z8 = 0.922+W8\/l - 0.92, 29 = O.Qzs + wWl - O.92, 
210 = wio, zii = wii, and 212 = Wi2- 
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Figure 2: Runs using Metropolis updates of all parameters at once, with two values for the proposal 
standard deviation, s. 

However, the bottom plot shows that this is not the case. The rejection rate here, with s = 0.35, 
is high, at 94%, but this chain explores additional modes that were missed in the run with s = 0.25. 
The sample from 300,000 iterations is nevertheless far from adequate. Some modes were moved to and 
from only once in the run, so an accurate estimate of the probality of each mode cannot be obtained. 

Metropolis updates that change only one parameter at a time, systematically updating all pa- 
rameters once per MCMC iteration, perform better. (This is presumably because in the posterior 
distribution there is fairly low dependence of some parameters on others, so that fairly large changes 
can sometimes be made to a single parameter.) Updating a parameter is slow, requiring about the 
same computation time as an update changing all parameters. However, updates that change only ij, 
or (if computations are done using eigenvectors) only a, will be fast. To allow a fair comparison, the 
number of iterations done was set so that the number of slow evaluations was 300,000, the same as 
for the runs with Metropolis updates of all variables at once. To make the comparison as favourable 
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Figure 3: Runs using Metropolis updates of one parameter at a time, with extra updates of fast 
parameters for the lower plot. 

as possible for this method, I assumed that computations are done using eigenvectors, so that both rj 
and a are fast variables. 

The top plot in Figure [3] shows results using such single-variable Metropolis updates (selected as 
among the best of runs with various settings that were tried). Here, the proposal standard deviation 
was 2 for the parameters, and 0.6 for the rj and a parameters. (Recall that all parameters are 
represented by their logs.) The rejection rates for the updates of the variables ranged from 53% to 
77%. One can see that sampling is significantly improved, but that some modes are still visited only a 
few times during the run. One way to try to improve sampling further is to perform additional updates 
of the fast variables. The bottom plot in Figure [3] shows the results when 49 additional Metropolis 
updates of the fast variables (ry and a) are done each iteration. This seems to produce little or no 
improvement on this problem. (However, on some other problems I have tried, extra updates of fast 
variables do improve sampling.) 
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The time required for a slow evaluation using computations based on the Cholesky decomposition 
was 0.0081 seconds on the Intel machine, and 0.0661 seconds on the SPARC machine. When using 
computations based on eigenvectors, a slow evaluation took 0.0128 seconds on the Intel machine, 
and 0.0886 seconds on the SPARC machine. The ratio of computation time using eigenvectors to 
computation time using the Cholesky decomposition was therefore 1.58 for the Intel machine and 1.34 
for the SPARC machine. Note that this ratio is much smaller than the ratio of times to compute 
eigenvectors versus the Cholesky decomposition, since times for other operations, such as computing 
the covariances, are the same for the two methods, and are not negligible in comparison when n = 100 
and p = 12. 

Performance of ensemble MCMC. I now present the results of using ensemble MCMC methods 
on this problem. The Metropolis method was used to update the ensemble, so that a direct comparison 
to the results above can be made, but of course other types of MCMC updates for the ensemble are 
quite possible. I tried Metropolis updates where the proposals both changed the slow variables and 
shifted the ensemble of values for fast variables, but keeping the ensemble of fast variables fixed seemed 
to work as well or better for this problem. Updating only one slow variable at a time worked better 
than updating all at once. 

Accordingly, the results below are for ensemble updates consisting of a sequence of single-variable 
Metropolis updates for each slow variable in turn, using Gaussian proposals centred at the current 
value for the variable being updated. After each such sequence of updates, the ensemble was mapped 
to a single state, and a new ensemble was then generated. (One could keep the same ensemble for many 
updates, but for this problem that seems undesirable, considering that regenerating the ensemble is 
relatively cheap.) 

I will first show the results when using computations based on eigenvectors, for which both rj and 
a are fast variables. Three ensembles for fast variables were tried — an independent ensemble, with 
the distribution being the Gaussian prior, an exchangeable ensemble, with the ensemble distribution 
being Gaussian with standard deviations 0.4 times the prior standard deviations, and a 7 x 7 grid 
ensemble, with grid extent chosen uniformly between the prior standard deviation and 1.1 times the 
prior standard deviation. All ensembles had K = 4:9 members. The number of iterations was set so 
that 300,000 slow evaluations were done, to match the runs above using standard Metropolis updates. 

Figure H] shows the results. Comparing with the results of standard Metropolis in Figure [3l one 
can see that there is much better movement among the modes, for all three choices of ensemble. 
The exchangeable and grid ensembles perform slightly better than the independent ensemble. I tried 
increasing the size of the ensemble to 400, and performing extra updates of the fast variables after 
mapping back to a single state, but this only slightly improves the sampling (mostly for the independent 
ensemble) . 

The time per iteration for these ensemble methods (with K = 49) was 0.0139 seconds on the Intel 
machine and 0.0936 seconds on the SPARC machine (very close for all three ensembles). This is slower 
than standard Metropolis using Cholesky computations by a factor of 1.7 for the Intel machine and 
1.4 for the SPARC machine. The gain in sampling efficiency is clearly much larger than this. 
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Figure 4: Runs using ensemble MCMC with computations based on eigenvalues (both rj and a fast), 
for three ensembles. 
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Figure [5] shows the results when computations are based on the Cholesky decomposition, so that 
only T] can be a fast variable. Sampling is much better than with standard Metropolis, but not as good 
as when both r] and a are fast variables. However, the computation time is lower — 0.0082 seconds per 
iteration for on the Intel machine, and 0.0666 seconds per iteration on the SPARC machine. These 
times are only about 1% higher than for standard Metropolis, so use of an ensemble for rj only is 
virtually free. 

Discussion 

I will conclude by discussing other possible applications of ensemble MCMC, and its relationship with 
two previous MCMC methods. 

Some other applications. Bayesian inference problems with fast and slow variables arise in many 
contexts other than Gaussian process regression models. As mentioned early, many Bayesian models 
have hyperparameters whose conditional distributions depend only on a fairly small number of pa- 
rameters, and which are therefore fast compared to parameters that depend on a large set of data 
points. 

For example, in preliminary experiments, I have found a modest benefit from ensemble MCMC 
for logistic regression models in which the prior for regression coefficients is a i distribution, whose 
width parameter and degrees of freedom are hyperparameters. When there are n observations and 
p covariates, recomputing the posterior density after a change only to these hyperparameters takes 
time proportional to p, whereas recomputing the posterior density after the regression coefficients 
change takes time proportional to np (or to n, if only one regression coefficient changes, and suitable 
intermediate results were retained). The hyperparameters may therefore be seen as fast variables. 

In Gaussian process models with latent variables, such as logistic classification models (Neal 1999), 
the latent variables are all fast compared to the parameters of the covariance function, since recom- 
puting the posterior density for the n latent variables, given the parameters of the covariance function, 
takes time proportional n^, with a small constant factor, once the Cholesky decomposition of their 
covariance matrix has been found in time. Looking at ensemble MCMC for this problem would be 
interesting, though the high dimensionality of the fast variables may raise additional issues. 

MCMC for state-space time series models using embedded Hidden Markov models (Neal, Beal, and 
Roweis, 2004) can be interpreted as an ensemble MCMC method that maps to the ensemble and 
then immediately maps back. Looked at this way, one might consider updates to the ensemble before 
mapping back. Perhaps more promising, though, is to use a huge ensemble of paths defined using an 
embedded Hidden Markov model when updating the parameters that define the state dynamics. 

Relationship to the multiple-try Metropolis method. A Metropolis update on an ensemble 
of points as defined in this paper bears some resemblence to a "multiple-try" Metropolis update as 
defined by Liu, Liang, and Wong (2000) — for both methods, the update is accepted or rejected based 
on the ratio of two sums of terms over K points, and in certain cases, these sums are of 7r(x(*^) for the 
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Figure 5: Runs using ensemble MCMC with computations based on Cholesky decomposition (only rj 
fast), for three ensembles. 
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K points, x^^\ . . . ,x^^\ In a multiple-try Metropolis update, K points are sampled independently 
from some proposal distribution conditional on the current point, one of these K proposed points is 
then selected to be the new point if the proposal is accepted, and a set of K points is then produced by 
combining the current point with K — 1 points sampled independently from the proposal distribution 
conditional on this selected point. Finally, whether to accept the selected point, or reject it (retaining 
the current point for another iteration), is decided by a criterion using a ratio of sums of terms for 
these two sets of K points. 

In one simple situation, multiple-try Metropolis is equivalent to mapping from a single point to an 
ensemble of K points, doing a Metropolis update on this ensemble, and then mapping back to a single 
point. This equivalence arises when the proposal distribution for multiple-try Metropolis does not 
actually depend on the current state, in which case it can also be used as an independent ensemble 
base distribution, and as a proposal distribution for an ensemble update in which new values for all 
ensemble members are proposed independently]! This method will generally not be useful, however, 
since there is no apparent short-cut for evaluating it{x) at the K ensemble points (or K proposed 
points) in less than K times the computational cost of evaluating tt{x) at one point. 

Applying multiple-try Metropolis usefully to problems with fast and slow variables, as done here for 
ensemble MCMC, would require that the K proposals conditional on the current state be dependent, 
since for fast computation they need to all have the same values for the slow variables. Liu, et al. 
mention the possibility of dependent proposals, but provide details only for a special kind of dependence 
that is not useful in this context. 

In this paper I have emphasized the need for a computational short-cut if ensemble MCMC is to 
provide a benefit, since otherwise it is hard to see how an ensemble update taking a factor of K more 
computation time can outperform K ordinary updates. The analogous point regarding multiple-try 
Metropolis was apparently not appreciated by Liu, et al., as none of the examples in their paper make 
use of such a short-cut. Accordingly, in all their examples one would expect a multiply-try Metropolis 
method with K trials to be inferior to simply performing K ordinary Metropolis updates with the 
same proposal distribution, a comparison which is not presented in their paper0 



Relationship to the multiset sampler. Leman, Chen, and Lavine (2009) proposed a "multiset 
sampler" which can be seen as a particular example of sampling from an ensemble distribution as in 
this paper. In their notation, they sample from a distribution for a value x and a multiset of k values 
for y, written as s = {yi, . . . , y^}. The y values are confined to some bounded region, y, which allows 
a joint density for x and s to be defined as follow: 

k k 
7r*(x,s) a ^7r(x,yj) = tt{x) ^'K{yj\x) (29) 
0=1 i=i 

^In detail, using the notation of Liu, et al. (2000), we obtain this equivalence by letting T{x, y) = C^{y) and \{x, y) — 
V[C(2;)C(y)], so that w{x,y) = Tr{x)T{x,y)X{x,y) = -!t{x)/C{x). 

*For their CGMC method, the K ordinary Metropolis updates should use the same reference point. This is valid, 
since the reference point is determined by a part of the state that remains unchanged during these K updates. 
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where 7r{x,y) is the distribution of interest. In some examples, they focus on the marginal tt{x). 
Integrating the density above over yi, . . . ,7/^ shows that the marginal 7r*{x) is the same as 7t{x), so 
that sampling x and s from vr* will produces a sample of x values from vr. Leman, et al. sample from 
IT* by alternating a Metropolis update for just x with with a Metropolis update for yj with j selected 
randomly from {1, . . . , A;}. 

This vr* distribution for the multiset sampler is the same as the ensemble distribution that would 
be obtained using the methods for fast and slow variables in this paper, if x is regarded as slow (and 
hence written as xi in the notation of this paper), y is regarded as fast (and hence written as X2), and 
in the ^ density used in defining the ensemble base measure, the fast variables in the ensemble are 
independent, with uniform density over 3^. The density p defined by equation (j20p then corresponds 
to the multiset density above. 

Leman, et al. do not distinguish between fast and slow variables, however, and hence do not argue 
that the multiset sampler would be beneficial in such a context. Nor do they assume that there is 
any other computational short-cut allowing vr to sometimes be evaluated quickly, thereby reducing the 
amount of computation needed to evaluate vr*. They instead justify their multiset sampler as allowing 
A; — 1 of the y values in the multiset to freely explore the region y, since the one remaining y value 
can provide a reasonably high Tr{x,y) and hence also a reasonably high value for vr*. 

The development in this paper confirms this picture. With the p distribution corresponding to 
multiset sampling, the mapping T will (in the notation of Leman, et al.) go from a single {x,y) pair 
drawn from vr(x,y) to an ensemble with k values for y, one of which is the original y, and the other 
A; — 1 of which are drawn independently and uniformly from y. This is therefore the equilibrium 
distribuiton of the multiset samplier. The development here also shows that 7r(x, y) can be recovered 
by applying the T mapping, which will select a y from the multiset with probabilities proportional to 
vr(x,y). This makes the complex calculations in Section 6 of (Leman, et al, 2002) unnecessary. 

Unfortunately, it also seems that any benefit of the multiset sampler can be obtained much more 
simply and efficiently with a Markov chain sampler that randomly chooses between two Metropolis- 
Hastings updates on the original vr distribution — one that proposes a new value for x (from some 
suitable proposal distribution) with y unchanged, and another that proposes a new value for x along 
with a new value for y that is drawn uniformly from y. As is the case as well for ensemble MCMC 
and multiple-try Metropolis, one should expect that looking at K points at once will produce a benefit 
only when a short-cut allows vr for all these points to be computed in less than K times the cost of 
computing vr for one point. 
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